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Abstract. The increasing number of processing elements and decreas- 
ing memory to core ratio in modern high-performance platforms makes 
efficient strong scaling a key requirement for numerical algorithms. In 
order to achieve efficient scalability on massively parallel systems scien- 
tific software must evolve across the entire stack to exploit the multiple 
levels of parallelism exposed in modern architectures. In this paper we 
demonstrate the use of hybrid MPI/OpenMP parallelisation to optimise 
parallel sparse matrix-vector multiplication in PETSc, a widely used sci- 
entific library for the scalable solution of partial differential equations. 
Using large matrices generated by Fluidity, an open source CFD appli- 
cation code which uses PETSc as its linear solver engine, we evaluate the 
effect of explicit communication overlap using task-based parallelism and 
show how to further improve performance by explicitly load balancing 
threads within MPI processes. Wc demonstrate a significant speedup over 
the pure-MPI mode and efficient strong scaling of sparse matrix-vector 
multiplication on Fujitsu PRIMEHPC FXIO and Cray XE6 systems. 

Key words: PETSc, Hybrid MPI/OpenMP, strong scaling, task-based 
parallelism, hierarchical load balancing, sparse matrix-vector multiply 



1 Introduction 

Recent development in High Performance Computing (HPC) architectures has 
been driven by a clear trend towards greater numbers of lower power cores and 
a decreasing memory to core ratio. Numerical algorithms and scientific software 
have to adapt to these changes to efficiently utilise the available memory and 
network bandwidth. Hybrid programming techniques, where shared memory pro- 
gramming is combined with inter-node message passing, can be used to exploit 
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the multiple levels of parallelism inherent in modern architectures in order to 
achieve sustainable scalability on massively parallel systems. 

In this paper we describe the addition of OpenMP thread parallelism to the 
Portable Extensible Toolkit for Scientific Computation (PETSc) [5, 6]. PETSc is 
a widely used library for the scalable solution of partial differential equations and 
is often used as a key component of large scientific applications. Sparse matrix- 
vector multiplication (spMVM) is by far the most computationally expensive 
component of sparse iterative linear solvers [13]. Therefore we focus on optimis- 
ing spMVM within PETSc using hybrid programming techniques and evaluate 
strong scalability on large numbers of compute nodes. We demonstrate that us- 
ing task-based parallelism to hide communication latency can provide significant 
speedups over naive OpenMP parallelisation. Further, explicit thread-level load 
balancing can be used to give additional increases in performance, resulting in 
significantly improved scalability over pure-MPI implementations in the strong 
scaling limit. 

The matrices used for benchmarking our implementation are extracted from 
the open source, general-purpose, multi-phase computational fluid dynamics 
(CFD) code Fluidity [2]. Fluidity solves the Navier-Stokes equations and ac- 
companying field equations on arbitrarily unstructured finite-element meshes. 
It is used in areas including geophysical fluid dynamics, computational fluid 
dynamics and ocean modelling [10]. 

1.1 Sparse Matrix- Vector Multiplication 

PETSc offers a wide range of high-level components required for linear alge- 
bra, such as linear and non-linear solvers as well as preconditioners. These are 
based on a suite of parallel data structures which implement basic vector and 
matrix operations. The most computationally expensive operation for solvers 
and preconditioners alike is the multiplication of sparse matrices with an input 
vector. 

PETSc represents distributed MPI matrices by dividing them into diagonal 
and off-diagonal parts, which on each process are stored as sequential matrices. 
The diagonal sub-matrix hereby corresponds to the part of the input vector that 
is stored locally by the process. As a consequence of this storage strategy, as 
shown in Fig. 1, the matrix- vector multiplication is implemented in two phases: 

— First, each process multiplies its diagonal sub-matrix with the local elements 
of the input vector, while vector elements that reside off-process are gathered 
into the local memory of the executing process. 

— Off-diagonal matrix elements are then multiplied with the formerly remote 
vector elements and added to the partial solution. 

1.2 Related Work 

Sparse matrix multiplication is one of the most heavily used kernels in scientific 
computing and has therefore received attention from several groups [7, 9, 11, 15]. 
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(a) Diagonal matrix elements are multi- (b) The off-diagonal sub-matrix is then 

plied with the local part of the vector multiplied with a local copy of the gath- 

while remote vector elements are gath- crcd vector elements and added to the 

ered. partial solution. 



Fig. 1: Parallel sparse matrix- vector multiplication using 8 MPI processes. 

Multiple storage formats, optimisation strategics and even auto-tuning frame- 
works exist to improve spMVM performance on a wide range of multi-core archi- 
tectures [15]. On modern HFC architectures hybrid programming methods are 
being investigated to better utiUse the hierarchical hardware design by reducing 
communication needs, memory consumption and improved load balance [11]. In 
particular, task-based threading methods have been highlighted by several re- 
searchers, where dedicated threads can be used to overlap MPI communication 
with local work [11, 13, 14]. 

2 Hybrid MPI/OpenMP Parallelism 

Multi-core processors are now ubiquitous in HPC and programmers are effec- 
tively presented with two levels of parallelism: inside a compute node, cores 
share a contiguous memory address space and they can exchange information 
by directly manipulating this memory space; between nodes, distributed mem- 
ory parallelism is most commonly implemented using explicit message passing 
via MPI. Exposing and expressing both intra- and inter-node parallelism can be 
achieved using a hybrid programming approach. 

One motivation for moving away from MPI-only parallelised applications is 
given by memory limitations. While the number of cores is steadily increasing 
in modern HPC architectures, the memory available to each core is decreas- 
ing [11]. By exploiting thread-level parallelism, the same number of cores can 
be utilised within a single node while reducdng the MPI memory footprint [4]. 
For scientific applications based on domain decomposition, reducing the MPI 
process granularity also reduces data replication due to halos or ghost cells. 

Performance gains may also be expected from using fewer MPI processes, 
since it not only reduces communication overheads, but also improves the load 
balance between individual processes [11, 13]. However, reducing process-level 
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imbalance may have a negative effect on the load balance among threads, which 
in turn can be compensated for by node-level scheduling strategies, as discussed 
in Sec. 2.3. 

2.1 NUMA Architecture 

Non-Uniform Memory Access (NUMA) refers to multiprocessor systems whose 
memory is divided into multiple memory nodes. This architecture was designed 
to overcome the scalability limits of the symmetric multiprocessing (SMP) ar- 
chitecture. However, this hierarchical memory model for multi-core processors 
means that it takes longer for a process or thread to access some parts of the 
memory than others. 

It is therefore important to consider data locality in threaded applications, 
since regular off-domain memory access can be detrimental to the performance 
of already memory-bound applications. In order to minimise bus contention a 
parallel first touch memory initialisation is often used on NUMA architectures to 
bind data to the memory bank that is closest to the core subsequently using the 
data block [11]. In addition, thread and process pinning is required to optimise 
memory utilisation for all bandwidth-bound algorithms. 

When multiplying sparse matrices a master-only approach is most often used 
to parallelise the local computation steps using threads (see Sec. 1.1). However, 
threaded spMVM across multiple NUMA domains requires random but frequent 
off-domain memory access to fetch input vector elements. In order to avoid 
the high-latencies associated with off-domain data fetch NUMA domains can 
be treated as single address spaces connected by multiple MPI tasks within 
a compute node. This approach restricts threads to accessing a single NUMA 
domain as demonstrated in Sec. 4.1. 

2.2 MPI-Communication Overlap 

As described in Sec. 1.1, PETSc splits parallel spMVM into two phases in or- 
der to allow the multiplication of the diagonal submatrix to be overlapped with 
the MPI communication required to fetch off-core vector elements. Nevertheless, 
Schubert ct al. [13] showed that few MPI implementations provide truly asyn- 
chronous communication and significant performance gains can be achieved by 
using task-based threading, where a single thread is dedicated to actively per- 
form the localisation of global vcx-tor c;lc!nients. This approach not only overlaps 
MPI transfer latencies with computation but also hides any sequential overhead 
incurred from moving data to and from the required MPI buffer space. 

Task-based threading stands in contrast to traditional vector-based thread- 
ing, where all threads share the computational load evenly. In order to utilise the 
task-based variant the thread-parallel section needs to be lifted to enclose the vec- 
tor scattcr-gathcr operation. This prohibits the use of OpenMP parallel for 
pragmas to distribute the local row-wise computation among threads and re- 
quires the explicit computation of thread partition boundaries. 
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2.3 Thread- level Load Balance 

Traditional vector-based threading with OpenMP divides the number of matrix 
rows approximately evenly among threads by applying parallel for pragmas 
to the outer loop. This, however, ignores the fact that individual rows may in- 
cur varying amounts of computational work, creating a potential load imbalance 
within individual thread groups. Instead, thread-level load balance may be im- 
proved statically by dividing the number of non-zeros approximately equally 
between threads, as pointed out by Williams et al. [15]. 

It is important to note that the matrix stencil does not change during the 
solve. Thus, an explicit thread partitioning scheme may be computed after the 
matrix has been assembled and cached with the matrix object. This turns the 
load balance optimisation into a one-off cost, allowing, in principle, the use of 
load balancing schemes of arbitrary complexity. 

The method used in this paper starts with an initial greedy allocation, where 
each worker thread receives a block of contimious rows. This is followed by an 
iterative local diffusion algorithm, which fiirther balances the number of non- 
zeros allocated to each thread. This procedure balances the thread-level work 
load and memory bandwidth requirement according to floating point operations 
required for the solution. 

3 Benchmcirk 

The matrices used for benchmarking the hybrid MPI/OpenMP implementations 
have been generated by Fluidity from a global baroclinic ocean simulation, which 
is representative of a range of three-dimensional multi-scale oceanographic prob- 
lems [10]. The unstructured mesh is based on two-dimensional high-resolution 
coastline data that is extruded vertically using constant spacing. By changing 
the vertical resolution of the extruded mesh the size of the problem can be scaled 
linearly, allowing a controlled quasi-linear increase in work load for the extracted 
matrices. 

The benchmark matrices used in this work are pressure field solves extracted 
after five timesteps. The resulting matrices are solved using the Conjugate Gra- 
dient method with a Jacobi preconditioner and the number of iterations was 
limited to 10, 000. 

3.1 Cray XE6 

One of the benchmarking systems used for the work presented here is HECToR, a 
Cray XE6 based on the AMD Opteron 6200 Interlagos processor series and Crays 
Gemini interconnect [1]. The Interlagos compute nodes are based on two AMD 
Bulldozer processors, each with 16 cores at 2.3 GHz paired into two modules and 
a peak memory bandwidth of 51.2 GB/s. Each module has its own associated 
memory bank, resulting in four separate memory nodes per compute node [8]. 
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3.2 Fujitsu PRIMEHPC FXIO 

The second benchmarking system available to us is a 96-node Fujitsu PRIMEHPC 
FXIO system [3]. The PRIMEHPC FXIO is a UMA (Uniform Memory Access) 
architecture based on the SPARC64 IXfx processor. A single compute node has 
16 cores at 1.848 GHz and a peak memory bandwidth of 85 GB/s. 

4 Results 

In this section we evaluate the parallel performance of the different hybrid sp- 
MVM approaches detailed in Sec. 2. Since hybrid programming offers a complex 
set of choices on how to utilise a given hardware set, we start our investigation 
by analysing various process-to-thread ratios for fixed numbers of cores. This 
provides insights into the resource utilisation of each algorithm and provides 
an estimate for the best hybrid configuration to be used during the subsequent 
strong scalability study on large numbers of compute nodes. 

4.1 Hardv^fare Utilisation 

Figure 2 shows the performance of varying hybrid process-thread combinations 
on the Cray XE6 and Fujitsu PRIMEHPC FXIO systems. The left-most entry of 
the vector-based configuration constitutes the MPI-only baseline configuration. 
OpenMP overheads have been verified to be negligible for the given problem size 
using microbenchmarks [12]. 

On the XE6, using only a small number of compute nodes (Fig. 2a and 2b), 
the task-based algorithms with and without explicit thread-level load balancing 
perform best when running 8 threads wrapped by 4 MPI processes per node. This 
correlates with NUMA alignment, where threads are used only inside individual 
NUMA domains and MPI tasks connect separate memory nodes. A significant 
performance reduction can then be observed with 16 and 32 threads per process, 
which coincides with NUMA traffic being incurred due to fetching input vector 
elements (see Sec. 2.1). 

However, using 4096 cores (128 XE6 nodes, Fig. 2c), the task-based mode 
without explicit load balancing seems to defy the slowdown due to NUMA traffic 
when using 16 and 32 threads per process. We can conclude that the algorithm is 
now bound by memory bandwidth rather than latency. In contrast, the thread- 
balancing mode still experiences a latency slowdown, but exhibits superior per- 
formance with a NUMA-aligned configuration. This is due to an imbalance in 
vector elements required by each thread due to the explicit thread-balancing, 
which aggravates the algorithm's sensitivity to memory latency. 

Furthermore, both task-based modes significantly outperform the vector- 
based threading approach on 4096 cores, demonstrating the performance loss 
due to MPI communication ovcadieads. Although vector-based threading pro- 
vides better performance on small numbers of cores due to having an extra 
worker thread, on large numbers of compute nodes the approach struggles to 
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Fig. 2: Matrix multiplication run times on a fixed number of cores with varying 
thread-to-process ratios. The left most value represents a close approximation 
to MPI-only performance. Native compilers were used on both architectures. 
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utilise the given memory bandwidth with an increasing number of threads. As 
shown in Fig. 2c, performance is greatest with only two threads per process, 
indicating that the algorithm's performance is communication-bound. 

On the PRIMEHPC FXIO system, we observe similar scaling properties and 
resource limitations with an increasing number of processing cores for all three 
algorithms. Although the test system used for this work was limited to 1536 
cores, we can, therefore, infer an estimate of the the scaling behaviour of the 
PRIMEHPC FXIO architecture for large scale systems. 

The key difference to the XE6 is that PRIMEHPC FXIO is a UMA ar- 
chitecture, and therefore does not incur memory latency penalties due to using 
multiple memory nodes per thread group. This can be observed in Fig. 2a where, 
in contrast to the XE6, the task-based mode without thread balancing improves 
performance steadily with increasing numbers of threads per node. However, the 
same memory latency limitation on small numbers of cores affects the thread- 
balancing mode. 

On 1024 cores (64 PRIMEHPC FXIO nodes. Fig. 2b), the profiles exhibit 
properties similar to the 4096-core XE6 results. The vector-based mode is limited 
by inter-process communication and performs best with two threads per process, 
while the overall best performance is achieved by the thread-balancing approach 
using eight threads per process. 

4.2 Strong Scaling 

In this section we analyse the strong scalability of the described hybrid algo- 
rithms on the Cray XE6 system and compare their performance to a pure-MPI 
approach. All hybrid modes were run using four MPI processes per compute 
node with eight threads each in order to prevent NUMA traffic due to input 
vector elements (see Sec. 2.1). 

The matrix used in Fig. 3 has 13,491,933 degrees-of-freedom (DoF) and 
371,102,769 non-zero elements and was generated by a parallel Fluidity sim- 
ulation decomposed into 1024 sub-domains. For the hybrid modes the number 
of MPI processes used in the strong limit therefore matches the number of pro- 
cesses used during the original decomposition. For more than 1024 cores, how- 
ever, the pure-MPI mode uses more processes than the matrix was originally 
optimised for, resulting in a potential slowdown due to load imbalance. There- 
fore, an equivalent matrix which has been optimised for 8192 MPI processes has 
also been included in the benchmark (dashed line). 

At the low end of the scaling curve no significant performance differences 
can be noted. For more than 512 cores (16 XE6 nodes) the task-based hybrid 
methods show a better scalability over the vector-based approach. The thread- 
balancing implementation hereby performs best, maintaining a nearly constant 
parallel efficiency of > 88% between 512 and 2048 cores, and even experiences 
slightly super linear scaling between 1024 and 2048 cores. 

On the same matrix, the pure-MPI performance decreases significantly faster 
than the hybrid algorithms for more than 512 cores (16 XE6 nodes). The equiva- 
lent MPI runs using a more finely decomposed matrix, on the other hand, closely 
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match the performance of the task-based mode without thread-balancing up to 
2048 cores. However, in the strong limit the thread-balancing mode outperforms 
the optimised MPI runs. 

Furthermore, between 2048 and 4096 cores (64 and 128 XE6 nodes) we ob- 
serve strong super linear scaling for both task-based methods. Since the final 
runtime in the strong limit is below 4 seconds, we can deduce that scalability 
ceases at this point due to a lack of computational work and that the super 
linear scaling effects are due to favourable cache effects. 




32 64 128 256 512 1024 2048 4096 8192 

No. of Cores 




Fig. 3: Strong scaling results for the pressure matrix on up to 256 XE6 nodes 
(8192 cores). All hybrid modes use 4 MPI ranks per node and 8 threads per 
rank. 



Fig. 4 shows scalability on up to 32,768 cores (1024 XE6 nodes) when the 
workload of the matrix multiplication is increased by a factor of 4 by changing 
the vertical extrusion of the parent mesh (see Sec. 3). This matrix has 52,040,313 
DoF and 1,462,610,289 non-zeros and is based on a 4096-domain partitioning. 
The results follow the same general trend, with significant differences in per- 
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formance observable in the strong end of the scalabihty curve. The pure-MPI 
performance starts to deteriorate earlier and the super linear scaling in the high 
end is more pronounced for all approaches. 




256 512 1024 2048 4096 8192 16384 32768 

No. of Cores 




%6 512 1024 2048 4096 8192 16384 32768 

No. of Cores 

Fig. 4: Strong scaling results for a larger pressure matrix on up to 1024 XE6 
nodes (32768 cores). All hybrid modes use 4 MPI ranks per node and 8 threads 
per rank. Runs with less than 256 cores (8 XE6 nodes) have been omitted due 
to insufficient memory per MPI process. 



5 Summary and Discussion 

In this paper we present an analysis of the scaling properties of sparse matrix- 
vector multiplication using a hybrid MPI/OpenMP extension to the PETSc 
library. We compare hybrid vector-based and task-based algorithms with a pure- 
MPI variant using large matrices generated by Fluidity. We describe an extension 
to the traditional task-based approach, where the load balance among threads 
is optimised a-priori according to the number of non-zeros in each row. 
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The thread-balancing extension is shown to give superior performance when 
scaled to large numbers of compute nodes on a Cray XE6 system and on mod- 
erate numbers of nodes of a Fujitsu PRIMEHPC FXIO system. The algorithm 
achieves this by improving the memory bandwidth utilisation within a given 
compute node and by hiding MPI communication latency. This comes at the 
cost of increased memory latency effects on small numbers of cores, since the 
algorithm creates an imbalance in input vector elements per thread. However, 
once the main resource limitation of the algorithm shifts to memory bandwidth 
the thread-balancing approach can improve performance significantly. 

Furthermore, the thread-balancing approach enhances one of the fundamen- 
tal advantages of hybrid programming: By reducing the number of MPI processes 
the inliercint load imbalance among processes is reduced at the expense of load 
imbalance among threads. This is desirable, however, since we can deal with the 
thread imbalance explicitly by caching an optimised thread partitioning with 
the matrix. As a result, this approach improves work load balance and mem- 
ory bandwidth utilisation at the compute node level in order to increase overall 
performance. 
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